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^ Abstract. We propose a fast algorithm for mode rank truncation of the result 

^ of a bilinear operation on 3-tensors given in the Tucker or canonical form. If the 

^ arguments and the result have mode sizes n and mode ranks r, the computation 

costs 0(nr^ + r'^). The algorithm is based on the cross approximation of Gram 
^ matrices, and the accuracy of the resulted Tucker approximation is hmited by square 

^ root of machine precision. 

^ Keywords: Multidimensional arrays, structured tensors. Tucker approximation, 

fast compression, cross approximation 
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> 1 Introduction 

ON 

^ Data sparse representations of tensors and efficient operations in the corresponding 

formats play increasingly important role in many applications. In the paper we consider 
O ^ 3-tensor A = A[i, j,k] that is an array with three indices. The number of allowed 

2 values of each index is called mode size. To specify tensor indices explicitly, we use 

square brackets. This notation allows to easily specify different index transformations. 
For instance, unfoldings of rii x 1x2 x tls tensor A[t,j,k], are matricizations of sizes 
Tti X n2Tt3, 112 X TtiTi3 and 1x3 X nin2 that consist of columns, rows and tube fibers of A, 

A'^' = A[i,jlc], A^^' = A[j,ki], Af3^ = A[k,tj]. (1) 

Here we set row/column/fiber index of the tensor A[i, j, k] as row index and join the two 
others in one multiindex for columns of the unfolding. The result is considered as a two- 
index object (matrix), with row and column indices separated by comma. The difference 

*This work was supported by RFBR grants 08-01-00115, 09-01-12058, 10-01-00757, 10-01-09201, 
RFBR/DFG grant 09-01-91332, Russian Federation Gov. contract 17940 and Priority research program 
of Dep. Math. RAS. 
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between matrices and tensors is additionally stressed by use of uppercase letter instead 
of bold uppercase. The reshape of tensor elements assumes as well a change of the index 
ordering. For example, transposition of matrix reads (A[i,j])^ — A[j,i], vectorization 
reads a[ij] = A[i, j]. We see that the square bracket notation is rather self-explaining and 
suits for description of algorithms working with multidimensional data. We also will 
use the MATLAB-style round bracket notation a(i, j,k) to point to individual element 
of A[i, j,k] and a(i, :,k) to select a mode vector (i.e. row) from tensor A. Scalars and 
vectors are denoted by lowercase and bold lowercase letters. 

In numerical work with tensors of large mode size it is crucial to use data sparse 
formats. For 3-tensors, the most useful are the following. 

The canonical decomposition [20, 2, 19] (or canonical approximation to some 
other tensor) reads 

R R 

A[i,j,k] = ^ Us[i] ® Vs[j] » Ws[k], a(i,j,k) = ^u(i,s)v(j,s)w(k,s). (2) 

s=l s=l 

The minimal possible number of summands is called tensor rank or canonical rank of 
the given tensor A. However, canonical decomposition/approximation of a tensor with 
minimal value of R is a rather ill-posed and computationally unstable problem [8]. This 
explains why among many algorithms of canonical approximation (cf. [4, 7, 11, 25]) 
none is known as absolutely reliable, and no robust tools for linear algebra operations 
maintaining the canonical format (linear combinations, etc.) are proposed. 
The (truncated) Tucker decomposition/ approximation [28] reads 

A[i,j,k] = G[p,q,s] xi U[i,p] X2 V[j,q] X3 W[k,s], 

!■! rz T3 

= ^^^g(p,q,s)u(i,p)v(j,q)w(k,s). 

p=l q=l s=l 

The quantities ri,r2,r3 are referred to as Tucker ranks or mode ranks, the tensor 
G = G[p, q, s] of size ti x x is called the Thicker core, the symbol Xi designates the 
multiplication of a tensor by a matrix along the l-th mode, the mode factors U, V, W 
have orthonormal columns. In d dimensions, the memory to store r x r x . . . x r core is r'^, 
that is usually beyond affordable for large d and even for very small r (so-called curse 
of dimensionality). For d = 3, r ~ 100 the storage is small and Tucker decomposition 
can be used efficiently. 

In [26] the efficient operations with 3-tensors in canonical and Tucker formats are 
discussed, with approximation of the result in the Tucker format. Simple operations like 
linear combination of small number of structured tensors can be done using multilinear 
SVD [5] (or high-order SVD, HOSVD), with quasi-optimal ranks and guaranteed accu- 
racy. Linear combination of many tensors, convolution, Hadamard (pointwise) product 
of tensors and many other bilinear operations reduce to recompression of the following 
structured tensor 

F[i, j, k] = Kron(G, H)[ap, bq, cs] Xi U[i, ap] X2 V[j, bq] X3 W[k, cs], 

f(i-,j,l<) =_^^g(p,q,s)H(a,b,c)u(i, ap)v(j,bq)w(k,cs), 

pqs abc 
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with Ti X r2 X r3 core G[p, q,s], pi x p2 x p3 core H[a, b,c] and non- orthogonal factors 
U, V and W. Formally (4) is a Tucker-like format with larger mode ranks piTi , P2T2, P3T3, 
that should be reduced (truncated) maintaining the desired accuracy. Due to memory 
limitations, F[i, j,k] can not be assembled for mode sizes n > 10^ and auxiliary piTi x 
P2T2 X Pal's core can not be assembled for ranks r > 30 (see Tab. 1).^ The structure of 
F should be exploited without explicit evaluation of large temporary arrays. 

A practical rank-reduction algorithm proposed in [26] is a rank revealing version of 
iterative Tucker- ALS [22, 6] requiring 0{nr^ + r^] operations. However, the number of 
iterations in Tucker- ALS depends on the initial guess, and fast approximate evaluation 
of Tucker factors of (4) is important. 

In Sec. 2 we propose to approximate dominant mode subspaces of F[i, j,k] by the 
ones of simpler tensors. In Sec. 3 we compute dominant mode subspaces by a cross 
approximation of Gram matrices of the unfoldings. The resulted algorithm requires 
0[nr^ + r^) operations in three-dimensional case and can be easily generalized to higher 
dimensions using 0(dnr^ + dr'^+^) operations. Since it uses decomposition of Gram 
matrices, the accuracy is limited by square root of machine precision. In Sec. 4 we apply 
the proposed method to Hadamard product of electron densities of simple molecules 
and show that using the result as an initial guess, Tucker-ALS converges to almost 
machine precision in one iteration. 

In the paper we use Frobenius norm of tensors, that is defined as follows 

II A||^ (A, A) , (A, B) "^^^Y. ^y^bii^ 

i=i j=i k=i 

and spectral norm of tensor (cf. [9]) 

||A||2*= max A Xi X2 X3 = max (A, u®v®w), 

||ul|=||v||=||w||=l ||u||=l|v||=||w||=l 

induced by standard vector norm ||u||^ =^ ||u||2 = (u,u) = Y.^=] I'l^iP- 

*We always assume Ui = Ui = 113 = n and ri = T2 = T3 = pi = p2 = pa = r in complexity estimates 
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2 Approximation of dominant subspaces 



Our goal is to find approximate dominant subspaces of an ni x ni x ria tensor (4) 
producing an approximation in the Tucker format 

F[i,j,k] =T[a,|3,y] xi X[i,a] X2Y[j,|3] X3Z[k,y], ||F - F||f ^ £||F||f (5) 

with a desired (not very high) accuracy and values of mode ranks for F, close to optimal. 

Tucker factors X[i, oc], Y[j, |3] and Z[k, y] approximate dominant subspaces of rows, 
columns and fibers of F[i, j,k], respectively. They can be computed by SVD of the 
unfoldings of F, as proposed in [5], but this method requires evaluation of all elements 
of tensor and is not feasible for large mode sizes. We can compute (5) interpolating a 
given tensor on carefully selected set of elements. This is done in CrossSD algorithm [23], 
that requires evaluation of 0(nr + r^) tensor elements and uses 0[nr^ + r^) additional 
operations. For a structured tensor (4) this summarizes to 0[nr^ + r^) operations, i.e. 
the complexity is linear in mode size. However, pivoting and error checking involves 
heuristics and in certain cases is slower than the approximation itself. For example, 
computation of residual (F — F)[i, j,k] on 0[n] randomly picked elements uses 0(rLr'*) 
operations. 

To avoid heuristic approaches, we can evaluate dominant subspaces by proper de- 
composition of Gram matrices of the unfoldings. In [27] this idea was used for fast 
mode rank truncation of tensor given in the canonical form (2) with large number of 
terms. The proposed in [27] cross approximation algorithm is equivalent to an unfin- 
ished Cholesky decomposition and computes rank-r dominant basis using the diagonal 
and certain r columns of the Gram matrix. However, for the unfolding F[i, jk] of tensor 
F[i, j,k] the Gram matrix (FF^)[i, i'] reads 

(FFT)(i,i')=^}^ Y_ }^g(p,q,s)Ma,b,c)g(p',q',s')Ma',b',c') 

pqs abc p'q's' a'b'c' (6) 

(VTV)(bq,b'q')(WTw)(cs, c's')u(i, ap)u(i', a'p'), 

and it is easy to check, that evaluation of any element of (6) requires O(r^) operations. 
Therefore, the algorithm from [27] applied to (6) has O(rLr^) complexity, which is not 
promising even for moderate r. To perform faster, we propose to change the computa- 
tional objective and look for dominant subspaces of tensors with a simpler structure. 
Rewrite the tensor (4) as follows 

F[i,j,k] =U'[i,bq,cs] X2 V[j,bq] X3 W[k,cs], 

U'[i,bq,cs] = Kron(G, H)[ap,bq, cs] Xi U(i, ap). 

It is clear that the Tucker approximation of U'[i, bq, cs] gives a Tucker approximation of 
F[i, j,k] with the same mode-1 rank. Therefore, we can approximate dominant mode-1 
subspace of F by the one of U'. The accuracy of resulted approximation is estimated by 
the following theorem. 
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Theorem 1. For tensor F[i, j,k] given by (7) it holds 

||F||f^ ||U'||f||V||2||W||2, ||F||2^ ||U'||2||v||2||vv||2, 
and for mode-1 unfoldings F = F[i, jk] and U' = U'[i, pqcs] it holds 

||F||2^||U'||2||V||2||W1|2. 

Proof. The first and last parts follow directly from matrix inequalities 

||F[i,j,k]||p = ||F[i,jk]||F ^ ||U'[i,bqcs]|H|W[k,cs] « V[j,bq]||2 = ||U'||f||V||2||W||2, 
||F[i,jk]||2 ^ ||U'[i,bqcs]||2||W[k,cs]«V[j,bq]||2= ||U'[i,bqcs]||2||V||2||W||2. 

Second part reads 

||F[i,j,k]||2 = max (U'[i,pq,cs] X2 V[j,pq] X3 W[k,cs], u[i] ®v[j] ®w[k]) = 

ll"ll=lMI=||wi|=i 

max (v^V)[bq](U' xi u^)[bq,cs](W^w)[cs] ^ 

||u||=||v||=||w||=l 

^ max IIV^vllllU' xi u^||2||W"^w|| = 

|u||=||v||=||w||=l 

= ( maxllU' xi u^||2 1 ( max||VTv|| ) ( max llW^wll ) = IIU'I 

Viiuihi J ViMi=i 7 ViiHhi 7 



□ 



Corollary 1. For certain perturbation AU' of tensor U', the corresponding perturbation 
AF can be estimated as follows 

I|AF||f ^ IIAU'IIf IIU'IIf 



|F||f IIU'IIf ' IIFIIf 



IIAFII2 IIAU-II2 I|U1|2||V||2|,..|K . 

IIAFII2 ^ ||AU'||2 ||U'||2||V||2||W||2 
^ C2 , C2 = 



IIFII2 ||U'||2 IIFII2 

Remark 1. For any tensor, || A[i, j, k] II2 ^ ||A[i,jk]||2 ^ || A[i, j, k] ||f. 

To find a dominant mode-1 subspace of U'[i, bq,cs], we can use proper decomposi- 
tion of Gram matrix of the unfolding U'[i, bqcs], that reads 

A[i,i'] = (U'U'"')[i,i'] = U[i,ap] (G[p,pVH[a,a']) U[a'p',i'], 
G[p,p'] = G[p,qs]G[qs,p'], H[a, a'] = H[a, bc]H[bc, a']. 

Tensor U' has a simpler structure than F, and computation of the Gram matrix (9) is 
faster than (6). However, evaluation of A[i, i'] as full ni x tli array leads to O(n^r^) 
complexity. Looking for the methods with linear in mode size complexity, we are to use 
the cross approximation algorithms. 
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3 Cross approximation of Gram matrices 



Truncated singular /proper decomposition is used in cases where low-rank approxima- 
tion is required. This problem can be solved by faster methods, for example, those 
based on cross approximation A[i, j] ^ A[i, j] = U[i, J](A[I, J])^^ A[I, j], where I and J 
contain indices of certain rows and columns of A. This approximation is exact on the 
cross formed by rows I and columns J, but the overall accuracy depends heavily on the 
properties of A[I, J]. In [17, 18, 16] it is shown that a good choice for A[I, J] is maxi- 
mum volume (modulus of determinant) submatrix. Search of this submatrix in general 
case is NP-hard problem [1], and alternatives should be used, see [29, 13]. If the sup- 
ported cross is iteratively widened at each step by one row and column that intersect on 
element where residual is maximum in modulus, cross approximation method is equiv- 
alent to Gaufiian decomposition with complete pivoting. For Gram matrix the pivot is 
always on the diagonal and cross approximation is equivalent to unfinished Cholesky 
decomposition. The resulted algorithm exploiting structure of (9) is summarized in 
Alg. 1. 

Algorithm 1: Cross approximation for Gram matrix (9) 

Input: Structured tensor F = Kron(G, H) x i U Xi V X3 W, see (4) 
Output: Approximation A = XAX^ for Gram matrix (9), such that ||A 
Initialization: p = 0, A = 

1: G[p,p'] = G[p,qs]G[qs,p'], H[a, a'] = H[a, bc]H[bc, a'] 
2: for i = 1 , . . . , n do {Compute diagonal of matrix} 
3: Ut[a,p] =U[i,ap], d(i) = (Ut[a,p]G[p,p'], H[a, a']Ut[a',p']) 
4: end for 
5: nrm := ||d||i 
6: repeat 

7: i^, := argmaxi |d(i)| {Find new pivot} 

8: a(:,ij :=U[:,ap](H[a,a']UiJa',p']G[p',p])[ap] 

9: a(:,ij = XA(x( U,:))T 
10: X* := {a-a)/^^{a- a)(i*,u) 
11: d[i] := d[i] — |x^[i]p {Update diagonal of residual} 
12: x^ =: [Xx']b {Orthogonalize x^ to spanX} 
13: A + b^b =: VDV^ {Re-diagonalize decomposition} 
14: X:=[Xx']V, A:=D, A = XAX^, err := ||d||i 
15: until err ^ £ nrm or r = Tmax 



It is easy to see that evaluation of G and H, i.e. Gram matrices of the unfoldings 
G[p,qs] and H[a, be], requires 0[r^] operations in three-dimensional case and 0(r'^+^) 
in d-dimension case. With precomputed G and H every element a(i, i') is computed 
in O(r^) operations and a column a(:,i') is computed in 0(nr^ + r^) operations for 
three and d-dimensional case. For the rediagonalization of A + b^b matrix we can 
use algorithm proposed by Demmel (see [10], Alg. 5.3) that is implemented by the 



■A||f< £||A||p 



0[n) 

0(np) 
0(n) 
0(n) 

0(np) 

0(np2] 
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LAPACK procedure slaedS and has complexity O(p^). We conclude that approximation 
of rank-r dominant mode subspace of Gram matrix (9) in d-dimensional case requires 
0(nr^ + r'^+^) operations. 

The relation between accuracy of cross approximation of Gram matrices and corre- 
sponding low-rank approximation of initial matrices is given by the following theorem. 



Theorem 2. Consider a matrix U 



Ui U2 



. If the corresponding Gram matrix 



A = U^U 



All A12 
A21 A22 



allows the cross approximation 

A- 



Aii 
A21 



A 



-1 



All A12 



^ £||A||2, 



then there exists a matrix B such that 

||U-UiB^||2 ^ v^||U||2. 
Proof. Consider V = — Ui Aj^^ A12 + U2 and write 

yTy = A21 An^ All An^ A12 - A21 A,-; A12 - A21 A„^ A12 + A22 = A22 - A21 A,-; Au. 
Cross approximation is exact on the selected rows and columns 



(10) 



A 



All 
A21 



a; 



All A 
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A22- A2iAi;Ai2 





" 













(11) 



and it follows that ||VTy||2 ^ £||U^U||2 and ||y||2 < y£||U||2. We conclude that B"^ 



I A7M 



11 ^12 



provides (10). 



□ 



Remark 2. For U with U|Ui = I, UJU2 = el, U|U2 = 0, inequality (10) is sharp. 



Remark 3. For fixed Ui , matrix B^ 



I A^^^Ai2 



(U|Ui ) ^ U|U provides minimal 



residual U — UiB^ in Frobenius and spectral norms. See [14], where a nice estimates for 
accuracy of cross approximation of matrices and tensors are also given. 

Remark 4. spanB = spanX is the subspace of columns of the Gram matrix that 
support the cross approximation in Alg. 1. 

Since the spectral norm of the residual is not easy to evaluate, the stopping criteria 
in a practical algorithm is based on the Frobenius norm. On each step of Alg. 1 vector 
d contains the diagonal of residual (11) and 

n 

||d||i = Y_ = }^(V^V)(i,i) = ||V[i, 

i=l i 
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We can also implement stopping criteria based on eigenvalues stored in A. To do this, 
we can split them in 'dominant' and 'smaller' parts basing on desired tolerance e, and 
stop the process if during several iterations new eigenvalues fall into the smaller part. 
This criteria will approximate spectral norm more precisely, but as we see in numerical 
experiments, it generally does not differ from the Probenius-based one. 

Obviously, Alg. 1 can be applied in the same way to estimate other Tucker factors 
of (4). Due to roundoff errors, accuracy £ of Alg 1 is limited by machine precision tol, 
and for £ = tol, accuracy of (5) can be estimated by Thm. 2 as 

||F - F||2 ^ Vtoi^J c\[\l) + c\[V) + cl[W)\\?\\i> 
where C2(U) is defined in (8) and similar definition applies to V, W. 

4 Numerical examples 

Multidimensional data often appear in modern modelling programs. For example, in 
chemical packages, e.g. PC GAMESS, MOLPRO, the electron density function is given 
in canonical form (2) as a sum of tensor product of Gaussians, but with number of terms, 
that may be too large for practically feasible computations even for moderate molecules. 
In order to make computations efficient, further approximation (recompression) to the 
Tucker format can be performed. This problem was approached in [3] using Tucker- ALS 
algorithm, in [21] by Tucker- ALS with initial guess obtained from the coarser grids, 
in [12] by CrossSD algorithm, in [24] by individual cross approximation of canonical 
factors, in [27] by cross approximation of Gram matrices of unfoldings and in [15] by 
algorithms based on Wedderburn rank reduction. 

As an example, we apply the discussed algorithm for Hadamard multiplication of 
electron density given in Tucker format to themselves. This operation can be a building 
block for algorithm that computes pointwise cubic root of density, that is used in the 
Kohn-Sham model. A good initial guess for such methods can be evaluated by mimic 
algorithm [24]. 

The results of experiments are collected in Tab. 2. They were performed on Intel 
Xeon Quad-Core E5504 CPU running at 2.00 GHz using Intel Fortran compiler version 
11.1 and BLAS/LAPACK routines provided by MKL library. For each molecule, we 
show time in seconds T(Alg. 1) for evaluation of three dominant subspaces X[i, a], Y[j, |3] 
and Z[k, y] by Alg. 1 with accuracy of approximation of Gram matrices set to £ = 10^^^. 
Then we compute best core by convolution 

T[a, |3,y] = F[i,j,k] xi X[a,i] X2 Y[|3,j] X3 Z[y,k]. 

and check relative accuracy £(Alg. 1) of approximation (5) in Frobenius norm. The 
direct computation of all elements of residual requires a lot of computational time and 
the accuracy ||F — FHf was verified by comparing the result with Tucker approximation 
computed by CrossSD algorithm [23] with accuracy set to £ = 10^^^. The CrossSD 
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Table 2. Hadamard square of electron density, n-\ = 1x2 = 1x3 = 5121 



molecule 




T(Alg. 1) 


£(Alg. 


1) 


T(c3d) 


T(wsvdr) 


T(tals) 


£(tals) 


methane 


(74,74,74) 


4.0 


3.io~ 


-7 


78.6 


12.4 


37 


7.10-13 


ethane 


(67, 94,83) 


5.3 


6.10- 


-7 


76.8 


15.1 


42 


8.10-13 


ethanol 


(128,127,134) 


20 


5.io~ 


-7 


1050 


210 


473 


9.10-13 


glycine 


(62,176,186) 


38 


8.10- 


-7 


1260 


237 


442 


9.10-13 



algorithm was verified in [23, 12] by exhaustive check on parallel memory platforms, 
and can be considered as reliable answer. The residual between two Tucker formats is 
computed as proposed in [26]. 

Then we compute approximation of the same accuracy e(Alg. 1) by CrossSD [23] and 
WsvdR [15] algorithms and show the corresponding timings as T(c3d) and T(wsvdr). 
We also show time T(tals) for one iteration of Tucker- ALS [22, 6] with ranks fixed equal 
to the ranks of bases X, Y, Z, returned by Alg. 1. Then we apply one iteration of rank- 
revealing Tucker- ALS [24] with accuracy parameter set to £ = 10^^^ using bases X, Y, Z 
as initial guess, and show the accuracy of improved approximation by £(tals). 

We conclude that proposed algorithm is faster that other methods for this purpose 
and return approximation of dominant subspaces that allows to construct approximation 
with accuracy about square root of machine precision. Using the subspaces, computed 
by Alg. 1 as initial guess, rank revealing Tucker-ALS converges to almost machine 
precision in one iteration. 
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